Welcome

Language models, which use n-dimensional embeddings trained by ML algorithms, are fundamental to the infrastructure of NLP applications such as machine translation tools, plagiarism detection, and information retrieval. Because language models have gotten so large in recent years, now with billions and trillions of parameters, it is important to explore to what extent abbreviated versions of these models are intuitive or ‘human-readable’. Using dimensionality reduction and k-means clustering on FastText pre-trained word embeddings of a subset of spoken Finnish words[@jmyrberg_2019], we:

  1. Explore language aspects the reduced-dimensional model is capable of capturing,
  2. Describe patterns evident after clustering the reduced model,
  3. Discuss limitations of our methods and data, and
  4. Inform the reader of directions for further exploration.

For readers in English, Finnish words have been roughly glossed with Google Translate output. Please note that, while this is useful for observing general patterns, it contains errors.

The Data

The most frequent Finnish verb lemmas were sorted from the Kven lemma frequency list for Finnish compiled by researchers at UiT The Arctic University of Norway []. Isolating a single part of speech and using only lemmas allows us to see whether semantic likeness, rather than word form, results in intuitive clusters. Note that the list was generated and contains some errors.

Next, I loaded the corresponding fastText[@jmyrberg_2019] 300-dimensional word vector for each lemma and added a Google translate gloss to cope with my inferior Finnish vocabulary.

Dimensionality Reduction

Here, we explore multiple parameterizations of PCA, tSNE, and UMAP.

PCA

Two methods of Principle Component Analysis (PCA) are Spectral Decomposition (SD) and Singular Value Decomposition (SVD). SD uses the covariance and correlation between parameters (embedding columns), while SVD uses the covariance and correlation between tokens (rows). Here we use SVD for its higher numerical accuracy [@sthda_2017]. Initially, we examine scree plots, which tell us how much variance is captured in each dimension of the reduced models.

Scaled PCA (dark blue) captures less of the variance in the first two dimensions in this case. scaled=TRUE creates unit variance, so all the parameters have the same standard deviation. Since FastText parameters are already on scale [-1,1], this step was not actually necessary. But with other independent variables with diverse means and variances, it would be essential. Let’s take a look at the scatterplots:

tSNE

In t-Distributed Stochastic Neighbor Embedding (tSNE), a probability distribution between tokens is calculated such that similar token pairs (with low Euclidean distance) are assigned high probability, and dissimilar (high distance) token pairs are assigned low probability. The algorithm uses gradient descent to minimize the relative entropy (KL-divergence) between the reduced dimensional model and the original. Much of tSNE’s power results from how it adapts search bandwidth based on the density of the data: when we set a value for perplexity, this tells the model how far to look for nearest neighbors in its construction of the model.

Although it is generally considered more powerful than PCA, it is essential to know how parameters effect the outcome [@wattenberg_viegas_johnson_2016]: patterns and clusters may look more defined than they are meaningful. Learning rate and max iterations can impact how well gradient descent optimizes the model: we have to find a good balance, where we produce generalizable results without overfitting.

By default, rTSNE uses an initial PCA step. This helps speed up the algorithm for large datasets. Here, we try both for comparison. Setting a high perplexity will bring out more global structures (similar to UMAP), but takes a lot longer to converge [@understanding_umap_2022]. We run two rounds, with high perplexity = (nrows-1)/6, and low perplexity = high perplexity / 6.

UMAP

UMAP is better optimized for large datasets, performing reduction in just a fraction of time compared to tSNE. Many researchers prefer it for how its algorithm balances local and global information, keeping clusters far apart and more distinct [@understanding_umap_2022].

UMAP uses a search radius to find n_neighbors to each point. After constructing an initial graph of these connections in high dimensions, it reduces this to lower dimensions, representing the graph connections with the minimum distance parameter. Low minimum distance clusters the connections more closely, whereas higher minimum distance values spread the points out.

Because UMAP skews the data to balance local and global representation, it is common to misinterpret random noise as a meaningful cluster. For this same reason, the distance between clusters is not meaningful in itself [@understanding_umap_2022].

Here, we try four configurations of UMAP in addition to the default, changing n_neighbors and min_dist.

You’ll notice that UMAP default, as well as tSNE+PCA with low perplexity delivered the most intuitive models in terms of visible clusters. Next we will explore clustering and data frame sorting to describe what relationships the dimensions seem to capture.

Clustering

As you saw in Dimensionality Reduction, the projection of n dimensions onto a lower-dimensional subspace enables us to visualize where words the words are in 3D representation. If the representation is meaningful, related words will be closer together in general: that is, the Euclidean distance between two points should indicate to what degree the words are related.

You probably noted chunks or clusters with gaps between them in the reduced data. One common, unsupervised method for finding these clusters is the K-means algorithm. While this can help us distinguish overall categories in our word set, hierarchical clustering breaks clusters into binary trees, resulting in more detailed groupings which can capture related pairs and smaller groups.

K-means

K-means is a ‘hard’ clustering method, meaning it assigns each datum to one of k sets, calculating k cluster centroids such that the within-cluster variances are minimized. To find the ‘optimal’ number of clusters, I used NbClust R package.

This recommends a very low number of clusters, but I wanted to see more definition in how the words are grouped. I settled on k=12 so that clusters would be smaller and patterns hopefully more transparent. This is also convenient for visualization, as the largest of the discrete palettes in R contain 12 colors.

The dataset of one cluster is also posted below each chart. Try sorting column-wise to see which words end up near each other in each of the 3 dimensions: sometimes this is more apparent and useful than the plot itself, as words near each other in one dimension are spread apart in another, and don’t cluster well.

Notice that the semantic relationships in UMAP3 and 6 are not as clear in the tabular data, whereas the tSNE clusters and default UMAP sort nicely into related words in the tables. One can easily imagine the words which are listed next to each other appearing in the same paragraph of text.

DBSCAN

Density-based clustering can find clusters of different shapes and sizes [@datanovia_2018] without using centroids, which assume data within clusters is normally distributed.

Density-based clustering answers a different question: groups of points with a similar degree of relatedness are assigned to the same cluster, so clusters can be spread throughout the data. In each case, you can see the least dense ‘cluster’ stretching through the data points. As such, this captures points in groups per their ‘degree of relatedness’ rather than minimizing the distance from other individual points and assigning a centroid.

When looking at the sorted tables, this becomes even more clear: the largest, ‘0’ dbc cluster contains random words which are from a mixture of k-means clusters calculated previously, and there is no strong relationship amonst items; whereas the smaller, ‘tighter’ dbc clusters toward the top of the data are very clearly related and are almost perfect subsets of the k-means clusters. This could be a good way of filtering noisy words from data if you are only interested in highly related tokens.

Fuzzy Clustering

Fuzzy clustering, a ‘soft’ method, models uncertainty in cluster assignment: data points are assigned a probability of belonging to a given cluster. The fclust package uses fuzzy k-means as an algorithm. Here we use the default configurations with 12 clusters, so we can compare with the hard clustering methods above.

For visualization, points were darkened or given transparency based on the certainty that they belong to the cluster assigned. But fclust also provides the probabilities that each point belongs to each other cluster, k*N total values. In the data tables, tokens are sorted by cluster (fc) and then by the probabily that they belong to that cluster (fc_prb). So the tokens at the top of each cluster group should have a high degree of relatedness.

If you know exactly how many clusters you are looking for and you are only concerned with the items which are highly related, this might be a good method. But many points and would-be clusters at the outer reaches of the centroids are attenuated to the point that they’re invisible.

Hierarchical Clustering

Hierarchical methods assign layers of clusters which can be represented in a dendrogram.

Visualization techniques for hierarchical clustering of large datasets is challenging. Interactive plots that enable ‘expanding’ nodes of the grid by clicking tend to hide the data until we are deep within the hierarchical structure, clearly inappropriate for binary trees. Given the size of our data, these are very slow to compute and would take hundreds of clicks to get to the leaf nodes of the graph. Graphs that show the full tree structure tend to be hard to navigate and zoom for viewing detailed relationships. After some experimentation, I’ve settled on a simple dendrogram from the phylo library, split into 12 subtrees.

Note that the dendrogram has to be split into 12 subtrees for readability. My apologies for not including the native Finnish in these visualizations, which I fould difficult due to spacing and formating restrictions. After I consulted a native speaker, it became clear that the original Finnish words are often more clearly thematically related than the English translations. Out of context, this is difficult to detect when only given a superficial gloss.

A second, complex approach was to cluster the original k=12 kmeans clusters into subclusters, then cluster the 12 original centroids into superclusters, and assign RGB values based on functions of the 3-tiered cluster assignment.

It is questionable whether this method is in any way better than kmeans performed with a higher k, but I didn’t take the opportunity to test this. Breaking up the k=12 clusters did result in more refined subclusters, although assigning colors to these clusters in a visually clear is an additional challenge.

Summary

Findings

Compared to the embeddings used from the Finnish dialogue, the relationships between words here are less obvious, at least to a non-native eye. In the case of tSNE with low perplexity and the default UMAP reductions, many clusters are contain logical patterns:

  1. “discuss, prioritize, represent, document” represent formal business vocabulary.
  2. “threaten, violate, mock, hate” represent connotatively negative and aggressive words.
  3. “put, toss, fling, whip, peal” are transitive, physical actions related to cooking.
  4. “pump, flow, drill” are related to extraction of liquid from the earth.

However, these clusters as well as others contain noise and nonsensical outliers, while some clusters appear to contain unrelated words entirely. Perhaps a fluent speaker could see clearer relationships between some of the clustered words as they are typically used in context: I would certainly appreciate some commentary.

Personally, I prefer the split hierarchical clustering for showing nearest relationships and building these up into larger subgroups.

Limitations

Fastext vectors are calculated from immediate context and subwords, which doesn’t account for synonymy in the theoretical sense or global contextual relationships at the document level. Perhaps this is why the dialogue project delivered more dense clusters: the words from the dialogue, representing a wide range of word classes and inflectional forms, collocate with a higher probability, and the pretrained embeddings for the word forms are naturally closer together, as the tokens have been taken directly from an authentic context.

While it makes sense intuitively that some nouns may cluster nicely, verbs, especially when limited to one inflectional form, tend to be ‘lone wolves’ in the sense that they are the scaffolding on which grammatical constructions are built and are less likely to co-occur with other lemmas in a computational window of limited length. This might explain why these clusters are less dense than the mixed tokens from the dialogue.

Although I wanted to push the limits of visualization from 2D to 3D, it’s not apparent that anything was gained by doing this. In fact, if the third dimension spreads out related items which would have otherwise been clustered, it may have actually introduced noise, especially if the variance captured by the third dimension was quite a bit lower than dimensions 1 and 2. For this reason, captured variance, which I only assessed for PCA, would be worth investigating further.

Also, it has to be underscored that visualization, though it can help with insights, is no more informative (less informative, in my opinion) than a dataframe with multilevel sorting in the case of word vectors. Dataframes can show us Gaussian clusters along each of the reduced dimensions independently, giving us insight as to what semantic features the dimension itself approximates. When plotted in 2D or 3D, the visual interaction of these features can dilute the strength of their relationships along a single dimension. Below is a multiple Gaussian clustering of one component using mixtools library.

References

Please reference .bibtex in the GitHub repository.